28 research outputs found

    Análisis de métodos de parametrización y clasificación para la simulación de un sistema de evaluación perceptual del grado de afección en voces patológicas

    Get PDF
    Los procedimientos de evaluación de la calidad de la voz basados en la valoración subjetiva a través de la percepción acústica por parte de un experto están bastante extendidos. Entre ellos,el protocolo GRBAS es el más comúnmente utilizado en la rutina clínica. Sin embargo existen varios problemas derivados de este tipo de estimaciones, el primero de los cuales es que se precisa de profesionales debidamente entrenados para su realización. Otro inconveniente reside en el hecho de que,al tratarse de una valoración subjetiva, múltiples circunstancias significativas influyen en la decisión final del evaluador, existiendo en muchos casos una variabilidad inter-evaluador e intra-evaluador en los juicios. Por estas razones se hace necesario el uso de parámetros objetivos que permitan realizar una valoración de la calidad de la voz y la detección de diversas patologías. Este trabajo tiene como objetivo comparar la efectividad de diversas técnicas de cálculo de parámetros representativos de la voz para su uso en la clasificación automática de escalas perceptuales. Algunos parámetros analizados serán los coeficientes Mel-Frequency Cepstral Coefficients(MFCC),las medidas de complejidad y las de ruido.Así mismo se introducirá un nuevo conjunto de características extraídas del Espectro de Modulación (EM) denominadas Centroides del Espectro de Modulación (CEM).En concreto se analizará el proceso de detección automática de dos de los cinco rasgos que componen la escala GRBAS: G y R. A lo largo de este documento se muestra cómo las características CEM proporcionan resultados similares a los de otras técnicas anteriormente utilizadas y propician en algún caso un incremento en la efectividad de la clasificación cuando son combinados con otros parámetros

    Tuning of modulation spectrum parameters for voice pathology detection

    Full text link
    Acoustic parameters are frequently used to assess the presence of pathologies in human voice. Many of them have demonstrated to be useful but in some cases its results could be optimized by selecting appropriate working margins. In this study two indices, CIL and RALA, obtained from Modulation Spectra are described and tuned using different frame lengths and frequency ranges to maximize AUC in normal to pathological voice detection. After the tuning process, AUC reaches 0.96 and 0.95 values for CIL and RALA respectively representing an improvement of 16 % and 12 % at each case respect to the typical tuning based only on frame length selection

    Analysis of complexity and modulation spectra parameterizations to characterize voice roughness

    Get PDF
    Disordered voices are frequently assessed by speech pathologists using acoustic perceptual evaluations. This might lead to problems due to the subjective nature of the process and due to the in uence of external factors which compromise the quality of the assessment. In order to increase the reliability of the evaluations the design of new indicator parameters obtained from voice signal processing is desirable. With that in mind, this paper presents an automatic evaluation system which emulates perceptual assessments of the roughness level in human voice. Two parameterization methods are used: complexity, which has already been used successfully in previous works, and modulation spectra. For the latter, a new group of parameters has been proposed as Low Modulation Ratio (LMR), Contrast (MSW) and Homogeneity (MSH). The tested methodology also employs PCA and LDA to reduce the dimensionality of the feature space, and GMM classiffers for evaluating the ability of the proposed features on distinguishing the different roughness levels. An effciency of 82% and a Cohen's Kappa Index of 0:73 is obtained using the modulation spectra parameters, while the complexity parameters performed 73% and 0:58 respectively. The obtained results indicate the usefulness of the proposed modulation spectra features for the automatic evaluation of voice roughness which can derive in new parameters to be useful for clinicians

    Automatic age detection in normal and pathological voice

    Full text link
    Systems that automatically detect voice pathologies are usually trained with recordings belonging to population of all ages. However such an approach might be inadequate because of the acoustic variations in the voice caused by the natural aging process. In top of that, elder voices present some perturbations in quality similar to those related to voice disorders, which make the detection of pathologies more troublesome. With this in mind, the study of methodologies which automatically incorporate information about speakers’ age, aiming at a simplification in the detection of voice disorders is of interest. In this respect, the present paper introduces an age detector trained with normal and pathological voice, constituting a first step towards the study of age-dependent pathology detectors. The proposed system employs sustained vowels of the Saarbrucken database from which two age groups are examinated: adults and elders. Mel frequency cepstral coefficients for characterization, and Gaussian mixture models for classification are utilized. In addition, fusion of vowels at score level is considered to improve detection performance. Results suggest that age might be effectively recognized using normal and pathological voices when using sustained vowels as acoustical material, opening up possibilities for the design of automatic age-dependent voice pathology detection systems

    How Phonotactics Affect Multilingual and Zero-shot ASR Performance

    Full text link
    The idea of combining multiple languages' recordings to train a single automatic speech recognition (ASR) model brings the promise of the emergence of universal speech representation. Recently, a Transformer encoder-decoder model has been shown to leverage multilingual data well in IPA transcriptions of languages presented during training. However, the representations it learned were not successful in zero-shot transfer to unseen languages. Because that model lacks an explicit factorization of the acoustic model (AM) and language model (LM), it is unclear to what degree the performance suffered from differences in pronunciation or the mismatch in phonotactics. To gain more insight into the factors limiting zero-shot ASR transfer, we replace the encoder-decoder with a hybrid ASR system consisting of a separate AM and LM. Then, we perform an extensive evaluation of monolingual, multilingual, and crosslingual (zero-shot) acoustic and language models on a set of 13 phonetically diverse languages. We show that the gain from modeling crosslingual phonotactics is limited, and imposing a too strong model can hurt the zero-shot transfer. Furthermore, we find that a multilingual LM hurts a multilingual ASR system's performance, and retaining only the target language's phonotactic data in LM training is preferable.Comment: Accepted for publication in IEEE ICASSP 2021. The first 2 authors contributed equally to this wor

    Towards the differential evaluation of Parkinson’s Disease by means of voice and speech processing

    Full text link
    La Enfermedad de Párkinson (EP) es un transtorno degenerativo que afecta a un 1% de la población con más de 60 años en países industrializados. Esta enfermedad afecta a la indepenencia del paciente y a sus capacidades motoras, teniendo un impacto considerable en sus actividades diarias a medida que dicho transtorno avanza. Una detección precoz puede potencialmente frenar la progresión de la EP pero el tiempo requerido para realizar un diagnóstico clínico puede ir de unos meses a varios años. Así pues, se hacen necesarias nuevas herramientas objetivas y consistentes que ayuden al diagnóstico diferencial de la enfermedad y reduzcan dicho tiempo. La literatura ha demostrado que el análisis del habla de los pacientes proporciona información relevante sobre la presencia de la enfermedad y es una posible fuente de indicadores para sistemas de diagnóstico. El habla, una habilidad casi universal, conlleva la realización de movimientos precisos y coordinados, en los músculos laríngeos y articulatorios. Por lo tanto, el propósito de esta tesis es la propuesta y análisis de diferentes esquemas de ayuda al diagnóstico de la EP mediante el uso del habla como objeto de análisis. En esta tesis, cinco sets de experimentos han sido llevados a cabo, cada uno de ellos incluyendo nuevas aproximaciones al problema destinadas a evaluar la presencia de la enfermedad en pacientes con párkinson idiopático y sujetos de control de tres bases de datos diferentes. Cuatro de estos sets analizan aspectos articulatorios mientras que el quinto emplea características fonatorias y una combinación de estas junto con las articulatorias en un único esquema. En estas aproximaciones, varias técnicas de reconocimiento de locutor y del habla se emplean en un escenario distinto: la detección de la EP a partir del habla. Varias familias de características conocidas como los MFCC, PLP o LPC junto con otras nuevas basadas en el espectro de modulación son analizadas. Además, distintas técnicas de selección de fragmentos del habla son propuestas, como la destilación alofónica o de hitos acústicos, que permiten la obtención de ciertos segmentos del habla que son de interés para los propósitos de este trabajo. Las principales técnicas de clasificación empleadas son GMM-UBM e i-Vectors-GPLDA junto con otras nuevas como los GMM forzados. A través del análisis de los esquemas propuestos, se examina la influencia de la EP en diferentes segmentos del habla, permitiendo la extracción de conclusiones sobre el funcionamiento del habla disártrica parkinsoniana. Estos segmentos están referidos a grupos fonéticos relacionados con el estrechamiento del tracto vocal o el uso de la fuente glótica o bien incluyen transiciones entre unidades fonéticas como el inicio de una oclusión o el final de una vocal, más relacionados con la coordinación de los órganos articuladores. Los mejores resultados de precisión en la detección de la EP obtenidos con las metodologías propuestas alcanzan valores de entre el 85% y el 94%, con Area Under the Curve entre 0.91 y 0.99, dependiendo de la base de datos de estudio. Estos resultados se obtienen empleando el esquema basado en las técnicas de selección de fragmentos propuestas: destilación alofónica y de hitos acústicos. Del mismo modo, se concluye que las propiedades discriminativas del esquema fonatorio propuesto para la detección automática de la EP son limitadas en comparación con los esquemas articulatorios analizados. Los resultados sugieren que la EP afecta a los movimientos relacionados con todos los grupos de segmentos articulatorios pero tiene una influencia más clara en la actividad motora asociada a las consonantes con mayor estrechamiento del tracto vocal, principalmente, oclusivas y fricativas. Finalmente, las nuevas metodologías propuestas pueden contribuir al diagnóstico diferencial de EP durante la evluación clínica de pacientes y son un paso adelante para los sistemas de diagnóstico de la EP basados en los aspectos articulatorios del habla. ----------ABSTRACT---------- Parkinson’s Disease (PD) is a neurodegenerative condition that affects to 1% of population over the age of 60 in industrialized countries. This disease seriously affects a patient’s independence and motor capabilities, having a considerable impact on their daily activities as it advances. Early detection can potentially slow the progression of PD but, unfortunately, the required period of clinical diagnosis ranges from months to years. Therefore, new objective and reliable tools are needed to support the differential diagnosis of the disease and to reduce this time. The analysis of a patient’s speech has demonstrated to provide relevant information about the presence of the disease and, consequently, is a possible source of features to be used in diagnosis systems. Speech, an ability that is almost universal, involves coordination and precision of movements in mainly the laryngeal and articulatory muscles. The purpose of this thesis is to propose and study different approaches to support clinical diagnosis of PD employing speech as the object of analysis. In this thesis, five sets of experiments are carried out, each one containing new approaches aimed to detect the presence of the disease in the speech of idiopathic PD patients and controls from three different databases. Four of these experiments focus on the analysis of articulatory aspects while the fifth employs phonatory features and a final combination of phonatory and articulatory information into a single approach. In these approaches, several state-of-the-art speaker and speech recognition technologies are employed in a different scenario: the automatic detection of PD from speech. Several known feature families such as MFCC, PLP or LPC and new features based on the use of the modulation spectra are analyzed. Moreover, different speech frame selection techniques are proposed, such as allophonic distillation and acoustic landmark distillation, providing certain specific speech segments that are of interest to the purposes of this work. The main classification techniques employed are GMM-UBM and i-Vectors-GPLDA, along with new schemes such as the forced GMM. As a consequence of the analysis of the proposed approaches, the influence of PD in these specific segments is examined, allowing to extract conclusions about the functioning of the parkinsonian dyshartric speech. These segments are phonetic groups related to the narrowing of the vocal tract or the use of the glottal source, such as fricatives, plosives or vowels, or include mainly transitions between phonetic units, such as the beginning of a burst or the end of a vowel, more related with the coordination of the articulators. The best accuracy results in the detection of PD achieved with the proposed methodologies reach values ranging from 85% to 94% with Area Under the Curve between 0.91 and 0.99 depending on the database. These results are obtained largely by employing approaches based on the frame selection proposed techniques: allophonic and acoustic landmark distillations. In the same manner, it is concluded that the discriminatory properties of the proposed phonatory approaches to automatically detect PD are quite limited in comparison with the analyzed articulatory approaches. Results suggest that PD affects the movements related to all of the studied articulatory segmental groups but has a clearer influence in the consonants with a greater narrowing of the vocal tract, mainly plosives and fricatives. Finally, the new proposed methodologies demonstrate their ability to support the differential diagnosis of PD during a patient’s clinical assessment and are a step forward in the speech-based diagnosis systems for PD employing articulatory aspects

    On the design of automatic voice condition analysis systems. Part II: review of speaker recognition techniques and study on the effects of different variability factors

    Full text link
    This is the second of a two-part series devoted to the automatic voice condition analysis of voice pathologies, being a direct continuation to the paper On the design of automatic voice condition analysis systems. Part I: review of concepts and an insight to the state of the art. The aim of this study is to examine several variability factors affecting the robustness of systems that automatically detect the presence of voice pathologies by means of audio registers. Multiple experiments are performed to test out the influence of the speech task, extralinguistic aspects (such as sex), the acoustic features and the classifiers in their performance. Some experiments are carried out using state-of-the-art classification methodologies often employed in speaker recognition. In order to evaluate the robustness of the methods, testing is repeated across several corpora with the aim to create a single system integrating the conclusions obtained previously. This system is later tested under cross-dataset scenarios in an attempt to obtain more realistic conclusions. Results identify a reduced subset of relevant features, which are used in a hierarchical-like scenario incorporating information of different speech tasks. In particular, for the experiments carried out using the Saarbrüecken voice dataset, the area under the ROC curve of the system reached 0.88 in an intra-dataset setting and ranged from 0.82 to 0.94 in cross-dataset scenarios. These results let us open a discussion about the suitability of these techniques to be transferred to the clinical setting

    On the design of automatic voice condition analysis systems. Part I: review of concepts and an insight to the state of the art

    Full text link
    This is the first of a two-part series devoted to review the current state of the art of automatic voice condition analysis systems. The goal of this paper is to provide to the scientific community and to newly comers to the field of automatic voice condition analysis a resource that presents introductory concepts, a categorisation of different aspects of voice pathology and a systematic literature review that describes the methodologies and methods that are mostly used in these systems. To this end, pathological voice is firstly described in terms of perceptual characteristics and its relationship with physiological phenomena. Then, a prototypical automatic voice condition analysis system is described, discussing each one of its constituting parts and presenting an in-depth literature review aboutthe methodologies that are typically employed. Finally, a discussion about some variability factors that affect the performance of these systems is presented

    SMIL to MPEG-4 BIFS conversion

    Get PDF
    Traditional media, such as text, image, audio and video, have long been the main media resources and granted full support of standard desktop tools and applications. Interactive rich multimedia documents, adding resources such as video or synthetic animations and relying on complex synchronization among objects, are now making their entrance into the world as new multimedia formats emerge. In this context, the Synchronized Multimedia Integration Language (SMIL) is receiving more and more attention from content authors due to its fine property of multimedia synchronization and authoring interactivity for the content production. At the same time, MPEG-4 is designed to address the requirement of new generation of highly interactive multimedia applications, while simultaneously maintaining the support of traditional applications. MPEG-4 provides facilities (XMT and BIFS) to integrate and synchronize, spatially and temporally, many different media objects together. However, these facilities lack appropriate authoring tools to widen its audience and subsequently limit the application. In this paper, we present a comparative analysis between SMIL and XMT, the textual description of MPEG-4, to illustrate the pros and cons of these two major interactive media. And then we propose a conversion scheme from SMIL to the Binary Format for Scenes (BIFS) of MPEG-4 to take advantage of both formats. According to this scheme, we design the real implementation method using the current available tools and discuss the purpose and significance of such conversion. 1
    corecore